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Abstract 

Transcription factors (TFs) are regulatory proteins that bind DNA in promoter regions of the 
genome and either promote or repress gene expression. Here we predict analytically that enhanced 
homo-oligonucleotide sequence correlations, such as poly(dA:dT) and poly(dC:dG) tracts, statis- 
tically enhance non-specific TF-DNA binding affinity. This prediction is generic and qualitatively 
independent of microscopic parameters of the model. We show that non-specific TF binding affin- 
ity is universally controlled by the strength and symmetry of DNA sequence correlations. We 
perform correlation analysis of the yeast genome and show that DNA regions highly occupied by 
TFs exhibit stronger homo-oligonucleotide sequence correlations, and thus higher propensity for 
non-specific binding, as compared with poorly occupied regions. We suggest that this effect plays 
the role of an effective localization potential enhancing the quasi-one-dimensional diffusion of TFs 
in the vicinity of DNA, speeding up the stochastic search process for specific TF binding sites. The 
predicted effect also imposes an upper bound on the size of TF-DNA binding motifs. 

Keywords: Promiscuity of transcription factor-DNA binding; Free energy of transcription factor-DNA bind- 
ing 
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I. INTRODUCTION 



Transcription factors (TFs) are proteins that regulate gene expression in both prokaryotic 
{e.g. bacteria) and eukaryotic {e.g. yeast or human) cells. TFs bind regulatory promoter 
regions of DNA in the genome. It is commonly accepted that each transcription factor 
binds specifically a relatively small set of DNA sequences called TP binding motifs or TF 
binding sites (TFBSs). A TF binds its specific binding motifs with a higher affinity than 
other genomic sequences of the same length [H, A typical length of TF binding motif 
varies between 6 and 20 nucleotides. Recent high-throughput measurements of TF binding 
preferences on a genome-wide scale have challenged the classical picture of TF specificity 
[31,0]. These experiments measured binding preferences of more than a hundred transcription 
factors to tens of thousands of DNA sequences and demonstrated a high level of multi- 
specificity in TF binding [sl, 0| • It has been also pointed out that weak-affinity TF binding 
motifs are essential for gene expression regulation jsf. 

A key question is how TFs find their specific binding sites in a background of 10^ — 10^ 
non-specific sites in a cell genome. This question was first addressed theoretically in sem- 
inal works of Berg, Winter, and von Hippel 0, g. The central idea of this approach is 
that the search process is a combination of three-dimensional and one-dimensional diffu- 
sion (see 



10| for recent reviews). It has been shown in different theoretical models that 
one-dimensional diffusion (in different models termed 'sliding' or 'hopping') facilitates the 
search process under certain conditions llNl7l| . Despite the success of these phenomenologi- 
cal models, a complete understanding of the search process phenomena is still lacking |8| . In 
particular, one of the key, open questions is what makes a TF switch from three-dimensional 
diffusion to one-dimensional sliding in specific genomic locations jsj. Invariably, an assump- 
tion is made about the existence of some non-specific binding sites that bring TFs to the 
vicinity of DNA for one-dimensional sliding. This assumption is a key component of all 
theoretical models, yet the molecular origin of this effect is not understood jl, Recent 



single-molecule experimental studies undoubtedly show that different DNA-binding proteins 
spend the majority of their time non-specifically bound and diffusing along DNA [l8l-22 



The question is what biophysical mechanism provides such non-specific attraction towards 
genomic DNA and regulates the strength of this attraction at a given genomic location? 

Here we predict that DNA sequence correlations statistically regulate non-specific TF- 
DNA binding preferences. Depending on the symmetry and length-scale of sequence correla- 
tions, the non-specific binding affinity can be either enhanced or reduced. In particular, we 
show that homo-oligonucleotide sequence correlations, where nucleotides of the same type 
are clustered together generically reduce the non-specific TF-DNA binding free energy thus 
enhancing the binding affinity. Fig. [1] Sequence correlations where nucleotides of different 
types are alternating, lead to an opposite effect, increasing the non-specific TF-DNA binding 
free energy. Fig. [TJ Correlation analysis of the yeast genome regulatory sequences suggests 
that the predicted design principle is exploited at the genome-wide level, in order to increase 
the strength of non-specific binding at these regulatory genomic locations. 

The paper is organized as follows. First, we present a simple, analytically solvable model 
describing TF-DNA binding. This model uses two-nucleotide alphabet DNA sequences. We 
develop a stochastic procedure allowing us to 'design' DNA sequences with a controlled 
symmetry and strength of sequence correlations. We analyze the free energy of non-specific 
TF-DNA binding within the framework of this model, and give an intuitive explanation for 
the origin of the predicted effect. Second, we generahse the model to four- letter alphabet 
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DNA sequences and show that all key conclusions hold qualitatively true in this case, as 
well. Third, we compute the free energy of non-specific TF-DNA binding for yeast genomic 
sequences, and show that sequences highly occupied by TFs in vivo, possess a statistically 
higher propensity for non-specific binding to TFs, compared with sequences depleted in TFs. 
Finally, we conclude and propose experiments allowing a direct test of the predicted effect. 



II. THEORY AND RESULTS 

A. Free energy of non-specific TF-DNA binding in model sequences 

In this work we use a simple variant of the Berg-von Hippel model to describe TF-DNA 
binding For the analytical analysis we apply the model to artificial DNA sequences 
containing two types of nucleotides only, rather than four. However, we show that all key 
conclusions hold qualitatively true for four-nucleotide alphabet sequences, as well. 

The energy of a TF bound to DNA at a specific location i (see Fig. [1]): 

M+i-l 

u{i) = -K y: (1) 
j=i 

where i and j represent individual base-pairs, Ai is the effective length of the TF {i.e. the 
number of contacts between TF and DNA), aj = ±1 describes two possible nucleotide types 
at each position j, and K is the interaction strength. We therefore assume that the energy 
contributions of individual base-pairs to the total binding energy, U{i), are additive. We 
also assume that the energy of each contact is exclusively defined by the base-pair type. The 
sequence of a DNA molecule of length L is uniquely defined by the set of L numbers, aj, 
with j = 1...L. 

We note that Eq. ([T]) provides a minimal model for TF-DNA binding. It captures 
the recognition specificity of TF in a simplest possible way, by assigning different contact 
energies, +K and —K, with two possible nucleotide types. In reality, a TF recognizes 
DNA motifs forming a complex, cooperative network of hydrogen and electrostatic bonds 
[H, Yet we suggest that the design principle for enhanced non-specific TF-DNA binding 
predicted using such a simplified model, is likely to be quite general and robust with respect 
to microscopic details of TF-DNA interactions. 

The free energy of binding of an individual TF to DNA is given by J-" = —ksT In Z, with 
the partition function: 

Z = j2eM-Ui^/kBT), (2) 

i=l 

where ks is the Boltzmann constant, T is the absolute temperature, and we imply periodic 
boundary conditions. We ask the question, what are the statistical properties of J-" as a 
function of the symmetry and strength of DNA sequence correlations? 

In order to answer this question, we first 'design' DNA sequence using a stochastic design 
procedure. This procedure allows nucleotides within DNA sequence to anneal, with each 
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configuration being accepted with the Boltzmann probabihty: 



(3) 



where is the 'design' temperature controlling the strength of correlations (this is differ- 
ent from the thermodynamic temperature, T), Ed is the 'design', intra-DNA energy. For 
simplicity, we take into account only the nearest-neighbor interactions in the 'design' energy: 



(4) 



1=1 



with J being the 'design', intra-sequence interaction strength, and is the corresponding 

(5) 



Ising model partition function [23 

Zd = 2^ (cosh^ (/3d J) + sinh^ ■ 
where Pd = ^/ksTd. 

The ferromagnetic-like case, J > 0, produces sequences with homo-oligonucleotide 
stretches. The correlation length, ^ = — 1/ ln(tanh J|), is the characteristic length-scale of 
the correlations decay, {<Ji<7i^^) = exp(— x/^) [23]. The anti-ferromagnetic-like case, J < 0, 
produces sequences with a different symmetry of alternating nucleotides. Fig. [T] We define 
the average free energy of TF binding to DNA as the annealed average: 



(6) 



where the averaging is performed with the probability, p{Ed), Eq. ([3]), and (3 = l/ksT. 
The quenched averaging, {J^)g = — (InZ) is analyzed numerically below, and it gives 
qualitatively similar results. Fig. [2l The averaging in Eq. gives: 



■)L-M-1 



(Z) 



L 



(A^ + A^) (cosh^-^(/3,J) + sinh^-^(/3,J)) 
+ (A^ - A^) (cosh^--^(/3,J) - sinh^--^(/3,J)) 



e^l^dJ smh\f3K) + e- 



(7) 



where Zd is given by Eq. (|5]), and 



A± = e^^-^ cosh{PK) ± J e'^Pi^J sin\i\PK) + e 



(8) 



We argue that the DNA correlations symmetry affects statistically the interaction free 
energy. It is natural therefore to analyze the free energy difference, between 'designed' 
sequences and their randomized analogs, lacking any symmetry: 



(9) 
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where (Foo) is the free energy computed for entirely random sequences {i.e. for sequences 
designed using a very high value of T^, or equivalently, l//3dJ 3> 1). The first key property 
of (AJ-"), is that it is invariant with respect to the sign of the TF-DNA binding affinity 
constant, K. Second, it is always satisfied that (AJ^) < if J > (ferromagnetic-like 
correlations within designed DNA sequences, see Eq. (jl])), and (AJ^) > if J < (anti- 
ferromagnetic-like correlations). Fig. H] shows the behavior of (AJ-") at different magnitudes 
of the design strength. The central observation here is that the behavior of (AJ-") criti- 
cally depends on the symmetry and the length-scale of DNA sequence correlations. The 
presence of homo-oligonucleotide stretches along DNA sequences statistically increases the 
propensity of such sequences towards non-specific binding to TFs. The DNA stretches with 
alternating nucleotides of different types produce the opposite effect: such sequences will 
have a reduced propensity for non-specific binding. We note that the quenched average, 
(AJ^)g = — {\n{Z/Zoo)) computed numerically, is in good agreement with the annealed 
average. Fig. [2J 

The reduction of the TF-DNA binding free energy by the presence of homo-oligonucleotide 
sequence correlations can be understood intuitively in the following way. Homo- 
oligonucleotide sequence correlations generically enhance fiuctuations of the TF-DNA bind- 
ing energy, afj = (t/^) — (f/)^. This effect has to do with the symmetry: a TF sliding along 
correlated DNA sequences where nucleotides of the same type have the tendency to cluster, 
will experience homogeneous DNA 'islands', such as poly(dA:dT) and poly(dC:dG) tracts. 
Statistically, this leads to the dominant contribution of either very strong or very weak en- 
ergies to the TF-DNA binding energy spectrum. This symmetry effect leads therefore to 
the widening of the TF-DNA binding energy spectrum, P{U). Such widening generically 
leads to the reduction of the TF-DNA binding free energy, due to the fact that the domi- 
nant contribution to the partition function, Z, comes from the low-energy tail of P{U) (23 |. 
Alternatively, DNA sequence with enhanced antiferromagnetic-like correlations {i.e. with 
alternating nucleotides of different types) will lead to the opposite effect: a TF sliding along 
such sequence will experience very heterogeneous binding sites. This leads to the narrowing 
down of P{U), and consequently, to the increase of the non-specific TF-DNA binding free 
energy. 

We note that the predicted effect is not restricted to TFs, and it is operational for any 
other kind of DNA-binding proteins. 

B. Extension of the model to four-letter-alphabet DNA sequences 

In the following, we show that four-letter-alphabet DNA sequences demonstrate qualita- 
tively similar statistical binding properties, as those of two-letter-alphabet sequences ana- 
lyzed above. This will allow us in the following to extend all our insights gained from the 
analytical model directly to genomic DNA sequences. We argue that the same underlying 
physical mechanism controls the non-specific binding propensity in both cases. 

Contrary to the two-letter-alphabet DNA sequences, where within our modeling frame- 
work a TF is fully described by the single parameter K, in the four-letter-alphabet DNA 
case, a TF is characterized by four energy parameters, Ka, Kt, Kq, and Kc- Although 
those energy constants are generally unknown, their order of magnitude can be roughly es- 
timated as IksT, and in addition, we allow the TF-DNA contact energies to fiuctuate. We 
therefore draw these energies from the Gaussian probability distributions, P{Ka), with zero 
mean and standard deviations, cTq, where a = A,T,C,G; and we average the free energy over 
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many TF's realizations. 

The binding energy of TF at a given site i: 

M+i-l 4 

Ui^) = - E HK.al (10) 

j=i a_=l 

where a" is a four-component vector of the type ((5q,a, ^qt, ^aCi ^ac)) at each DNA position 
j, with the position of 1 specifying one of four possible identities, (A,T,C,G), of the base- 
pair at the position j, with dap being the Kronecker delta. The sequence design procedure is 
analogous to the one introduced above, Eq. (jlj), with the 4x4 symmetric matrix of the design 
potentials entering the sum, — J ai^al a ^j^i. The results for the average TF-DNA binding free 
energy in the ensemble of different TFs is shown in Fig. [31 The key conclusion here is that 
the lower the design temperature, T^, provided that in the design procedure nucleotides of 
the same type attract (and thus the longer the correlation length of homo-oligonucleotide 
stretches), the lower the TF-DNA binding free energy. 



C. Free energy of non-specific TF-DNA binding in yeast genome 

We ask further the key question: Is the predicted design principle for non-specific TF- 
DNA binding operating in a living cell? To answer this question, we computed TF-DNA 
binding free energies using yeast genome DNA. Our working hypothesis here is that if the 
predicted effect is operational, genomic regions that need to be highly-accessible by TFs 
should possess a higher propensity for non-specific TF-DNA binding than regions that need 
not be highly-accessible by TFs. To test this hypothesis we compiled two datasets of genomic 
DNA. First, we collected ~1600 high- confidence yeast DNA regulatory promoter sequences 
(for organelle organization and biogenesis genes), each sequence 100 nucleotide long. We use 
the term 'upstream' to describe this dataset. These upstream sequences are experimentally 
known to be highly-accessible by TFs. The second dataset involves a comparable number 
of weakly-accessible genomic sequences. For this purpose, we chose the first 100 nucleotide 
stretches of the mRNA coding regions of those organelle organization and biogenesis genes. 
We use the term 'downstream' to describe the second dataset. The datasets were compiled 



from Ref. 25 



It turns out that upstream sequences demonstrate statistically stronger homo- 
oligonucleotide correlations in A and T compared to downstream sequences, and the dif- 
ference in correlations of C and G is not significant between the datasets. The normalized 
correlation function, Caa{x), computed for the sets of upstream and downstream sequences, 
respectively, is shown in Fig. HI This function is defined as: Cq,q,(x) = Saa{x)/ {sl^^^^x)), 
where Saa{x) = {aa{i)cFa{i + x)), and (sj^„(x)) is obtained analogously, using the set of ran- 
domly permuted sequences averaged with respect to different random realizations. Ctt{,x) 
shows qualitatively similar behavior (data not shown). 

We now compare the TF-DNA binding free energies for those two datasets. In order to 
get rid of the compositional bias, for a given TF interacting with a given DNA sequence, 
we always compare the difference AJ-" between the actual free energy, J-", and the free en- 
ergy computed for the randomized sequence (preserving the nucleotide composition of each 
sequence), averaged over several random realizations, Foo'- AJ^ = T — Toq. We there- 
fore compute numerically the probability distribution, P(AJ-'), for these two datasets of 
sequences, interacting with a model set of TFs. The TF-DNA binding contact energies. 
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are drawn from the Gaussian distributions, P{Ka), as described above. We stress that 
the only external parameters entering the model are the standard deviations, cTq, of P{Ka). 
In our calculations we set cTq, = for all a. The computed P(AJ^) for upstream and 

downstream DNA sequences are shown in Fig. OA.. We also show the cumulative probability, 
at different values of the selectivity cut-off. Fig. [Sp. The central conclusion here is that due 
to the presence of enhanced homo-oligonucleotide {i.e. ferromagnetic-like) sequence corre- 
lations, non-specific TF-DNA binding is statistically enhanced. At the maximal selectivity 
cut-off, AJ-'c ~ —0.1 ksT per base-pair, the probability of TF binding with the free energy 
below AJ-'c to upstream DNA regions is over 30% higher than to downstream regions. Fig. 
[5j3. This effect leads to the shift of the thermodynamic equilibrium towards enhanced oc- 
cupancy of TFs binding upstream regions rather than downstream regions. The average 
strength of the effect on TF occupancy can be estimated from the difference of the average 
TF-DNA binding free energies, (AAJ^) = (AJ""?) - (^/\j^down\^ ^ -O.lfcsT per base-pair, 
between upstream and downstream DNA regions, respectively (difference between the peak 
positions in Fig. EJA.). For a TF forming M. contacts within the TF-DNA binding site, this 
difference will produce Uup/ndown — exp(0.1 ■ A^) shift in the relative binding occupancy, 
where n^p and rictown is the number of bound TFs in the upstream and downstream re- 
gions, respectively. For a typical TF forming contacts with 10 DNA base-pairs, this leads to 
nuplndown — 2.7. We emphasize that the latter estimate provides only a lower-bound limit 
for the strength of the predicted correlational effect. We suggest therefore that the predicted 
mechanism for enhanced non-specific TF-DNA binding is operational in promoter regions 
of a significant fraction of yeast genes. 

Finally, we note that our findings suggest the existence of an upper bound for the TF- 
DNA binding motif size, imposed by the maximal possible strength of non-specific binding. 
It is predicted ll| that if the free energy of TF-DNA non-specific binding falls below —2 ksT, 
this significantly slows down the sliding diffusion of TF along DNA. Our estimates therefore 
suggest that such slowing down is likely when the binding motif approaches the size of 20 
base-pairs. 



III. DISCUSSION AND CONCLUSION 

Here we predicted a generic biophysical mechanism, statistically regulating the strength 
of non-specific TF-DNA binding in a genome. We showed analytically and numerically, using 
both artificially designed and genomic DNA sequences, that homo-oligonucleotide correla- 
tions statistically enhance non-specific TF-DNA binding affinity. We used the term 'fer- 
romagnetic', to describe the symmetry of such correlations. Alternatively, DNA sequences 
possessing enhanced correlations of alternating nucleotides of different types (we termed 
such correlations as 'anti- ferromagnetic') have a reduced propensity for non-specific binding 
to TFs. 

Our model description of TF-DNA binding is highly simplified. Yet we suggest that the 
design principle for enhanced non-specific TF-DNA binding predicted in this work is likely to 
be quite general, it is operational in genomic locations highly occupied by TFs, and it is likely 
to be the rule rather than the exception. The robustness of our conclusions with respect 
to the details of the model stems from the fact that the predicted effect arises exclusively 
due to DNA sequence symmetry and its strength (which is determined by the length-scale 
of the correlations decay). Computational analysis of the TF-DNA binding free energy in 
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~ 1600 yeast genomic DNA regions highly occupied by TFs shows that those regions possess 
much higher propensity for non-specific binding compared with regions depleted in TFs. In 
our analysis we used a simple procedure to get rid of the DNA compositional bias, allowing 
us to fairly compare the relative free energies of non-specific binding in different genomic 
locations. 

We estimated that in yeast, the predicted effect leads to at least ~ 0.1 fc^T ~ 60cal/mol 
free energy reduction (on average) per DNA base-pair in contact with a TP, for DNA 
regions with enhanced propensity for non-specific binding. This leads to at least three fold 
concentration enrichment in TFs (on average) of such highly promiscuous DNA regions in 
yeast. We suggest therefore that in addition to all known signals, genomic DNA might also 
encode its intrinsic propensity for non-specific binding to TFs. The predicted effect plays the 
role of an effective, non-specific localization potential, enhancing the level of one-dimensional 
diffusion of TFs along genomic DNA at the genome-wide level, and thus speeding up the 
search process for specific TF binding sites [6-ll[. We stress that all our conclusions are 



obtained assuming a quasi-equilibrium nature of TF-DNA binding. It would be important 
to investigate the dynamic aspects of the predicted phenomena. 

It is important to note that too high level of non-specific TF-DNA binding impairs the 



overall search efficiency [Ul, ll2[. This suggests that the strength of the predicted effect in 
vivo might be subject to both positive and negative regulation. It has been pointed out in a 
seminal work of Iyer and Struhl [26^] that activity of poly(dA:dT) tracts increases with their 
length. We suggest that this observation is a direct consequence of the effect of enhanced 
non-specific TF-DNA binding by poly(dA:dT), predicted here. Another key observation of 
Iyer and Struhl [1^, that poly(dC:dG) functions in a similar manner to poly(dA:dT), further 
strengthens our prediction. 

Extensive correlation analysis of different organismal genomes and direct, large-scale mea- 
surements of TF-DNA binding preferences using DNA sequences with the controlled strength 
and symmetry of correlations, should provide an ultimate test of the phenomenon predicted 
here. Protein-DNA binding arrays [3] and high-throughput microfluidics technology al- 
low a direct experimental test of our predictions in vitro. A key experiment would measure 
the TF-DNA binding affinity in different sets of DNA, each set containing DNA sequences 
with a specific TF-DNA binding motif embedded in a background of non-specific sequences 
with a varying symmetry and strength of correlations between DNA sets. We expect that 
DNA sequences with enhanced homo-oligonucleotide correlations in background sequences, 
will generically possess a higher binding affinity to different TFs compared with background 
sequences either lacking such correlations or having correlations with alternating nucleotides 
of different types. 
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FIGURE LEGENDS 



FIG. 1: Schematic representation of the model for TF binding to DNA, and examples 
of DNA sequence correlation functions. A. Random sequence. B. Enhanced homo- 
oligonucleotide {i.e. ferromagnetic-like) correlations lead to statistically enhanced non- 
specific TF-DNA binding affinity. C. Enhanced anti-ferromagnetic-like correlations 
(alternating nucleotides of different types) lead to reduced non-specific TF-DNA bind- 
ing affinity. All examples of sequences represent simulation snapshots. D. Example 
of the correlation function computed for sequences with enhanced ferromagnetic-like 
correlations; and E. for sequences with enhanced anti-ferromagnetic-like correlations. 
The bold lines represent the exponential decay of the correlation functions. 

FIG. 2: TF-DNA binding free energy difference normalized per one base-pair, A/ = 
(3 (AJ^) /M, computed using Eq. ([7]) as a function of the reduced design tempera- 
ture, 1/PdJ (solid curves). The upper and lower branches of the graph correspond to 
J < (anti-ferromagnetic-like DNA sequence correlations) and J > (ferromagnetic- 
like correlations), respectively. The results of MC simulations of the system are in 
excellent agreement with the analytical results (filled circles). We used the parame- 
ters: /3K = 1, M = IS, L = 1000. In MC simulations we used 7.5 x 10^ MC moves to 
design each DNA sequence at each value of T^. In order to generate each point in the 
plot we used the set of 100 sequences. In order to compute error bars we divided each 
set of 100 sequences into 10 subsets randomly, and then calculated standard deviation 
of the subsets averages for A/. The error bars correspond to one standard devia- 
tion. The numerically computed quenched average, — {\n{Z/Zoo)) /A4, is also shown 
(filled squares). In the computations we used the same parameters and definitions as 
specified above. Inset: The same data for A/ as in the main figure, but plotted as a 
function of ^. 

FIG. 3: The average TF-DNA binding free energy. A/, numerically computed at different 
values of the design temperature, where A/ = — (ln(Z/Zoo)) /A^, where Z^o is the 
partition function for entirely random DNA sequence. We designed 200 sequences 
with the length L = 400 at each T^. We performed 5 x 10^ MC steps to design each 
sequence, where in each MC step we attempted to exchange two base-pairs chosen 
at random. The overall nucleotide composition for each sequence was uniform and 
fixed. The design potential was +J (attraction) for identical nearest-neighbor base- 
pairs and — J (repulsion) for different nearest-neighbor base-pairs, with J = l/c^T. 
The contact energies, Ka, were drawn from a Gaussian distribution, P{Ka), with zero 
mean, (Ka) = 0, and standard deviation, cTq, = 2kBT, for each nucleotide type, a. We 
computed A/ as an average over 250 TFs and 200 sequences at each T^, and used 
= 8. The error bars are calculated as specified in Fig. [2l and they are smaller than 
the marker size. 

FIG. 4: The normalized correlation function, Caa{x) (see the text for the definition), com- 
puted for upstream (circles) and downstream (squares) sequence sets. Each set consists 
of 1, 663 sequences; each sequence is 100 nucleotide long. 

FIG. 5: A. The computed P{Af) for 1,663 upstream (dark) and downstream (bright) 
yeast genomic sequences, where A/ = /3 (AJ^)^^, /Ai, and AJ^ = T — Toq. For each 
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given TF, Too is computed as an average over 50 randomized sequence replicas (ran- 
domization preserves the nucleotide composition of each sequence). For each sequence 
we computed for 250 TFs and then took the average of these 250 values, (A^)^^. 
We used M. — %. The TF-DNA contact energies, K^^ are drawn from a Gaussian prob- 
ability distribution, P{Ka), with zero mean and standard deviation a„ = 2A;bT, where 
a represents four possible nucleotides. Vertical lines show the mean of A/. B. The cu- 
mulative probability, Pr(A/ < A/o) = J^^° P(A/)dA/, computed using P(A/) from 
(A), for upstream (dark) and downstream (bright) sequences, respectively Inset: The 
difference between upstream and downstream Pr(A/ < A/o). 



12 



A rm-n 

...TATTTATATAATATAAATAATTAATAT" 



B 

TTTTTTT. 



I I II I I 

„ AAATTTTAAAATTTTTTTTTTTTl 



C rmri 

...ATATATATATATATATATAATATATATAATA,. 




FIG. 1: 



13 




_] I I I 1_ 

12 3 4 

log(1/|3^J) 



FIG. 2: 



A/ 




1 2 3 4 

log(1%y) 



FIG. 3: 



15 




FIG. 5: 
16 



